We have here a dataset of a representative portion of the appartments in the biggest cities in the world. We found this dataset on public.opendatasoft.com, without any further study done on it, we've chosen to do it ourselves.
Let's analyse the following situation : someone owns an appartment in Paris and has collected some data on it. He or she wants to know what is the cost of one night rental on Airbnb, according to data on others appartments.
That's what the study is about. Thanks to the dataset, we are going to try and predict the rent according to other features such as the disctrict, the number of rooms or the reactivity of the owner on Airbnb.
We are going to use the following libraries : Pandas and Numpy for the datascience, Matplotlib, seaborn and plotly for data visualization and last but not least, scikit-learn for machine learning.
import pandas as pd
import numpy as np
df = pd.read_csv('airbnb-listings.csv', delimiter=";")
df.head()
| ID | Listing Url | Scrape ID | Last Scraped | Name | Summary | Space | Description | Experiences Offered | Neighborhood Overview | ... | Review Scores Communication | Review Scores Location | Review Scores Value | License | Jurisdiction Names | Cancellation Policy | Calculated host listings count | Reviews per Month | Geolocation | Features | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4008728 | https://www.airbnb.com/rooms/4008728 | 20170402075052 | 2017-04-02 | Luxurious 3 bedroom, centrum, 180m2 | This luxurious apartment is situated in the c... | In the heart of hip & lively Amsterdam Centrum... | This luxurious apartment is situated in the c... | none | There is a wide variety of cafes and restauran... | ... | 10.0 | 10.0 | 9.0 | NaN | Amsterdam | strict | 1.0 | 1.52 | 52.36523660409065,4.878249575101092 | Host Has Profile Pic,Host Identity Verified,Is... |
| 1 | 7778612 | https://www.airbnb.com/rooms/7778612 | 20170402075052 | 2017-04-02 | Luxury apartment in city centre | Beautiful, large (105m2, 2 floors) and quiet a... | 1 living room 1 dining room 2 bedrooms 1 bathr... | Beautiful, large (105m2, 2 floors) and quiet a... | none | Within a few minutes walking you'll be in the ... | ... | 10.0 | 10.0 | 9.0 | NaN | Amsterdam | strict | 1.0 | 0.81 | 52.367309030897516,4.8738407410699915 | Host Is Superhost,Host Has Profile Pic,Host Id... |
| 2 | 8264596 | https://www.airbnb.com/rooms/8264596 | 20170402075052 | 2017-04-02 | Cosy apartment across Vondelpark | This central located apartment will make you f... | NaN | This central located apartment will make you f... | none | NaN | ... | 10.0 | 10.0 | 10.0 | NaN | Amsterdam | flexible | 1.0 | 0.05 | 52.361943953804776,4.866686522141489 | Host Has Profile Pic,Host Identity Verified,Is... |
| 3 | 2180729 | https://www.airbnb.com/rooms/2180729 | 20170402075052 | 2017-04-02 | Spacious City Apartment Oud-West | Living like an Amsterdam resident in this ligh... | There are 2 bedrooms each with a double bed. I... | Living like an Amsterdam resident in this ligh... | none | The 85 m2 apartment is located on a quiet stre... | ... | 9.0 | 9.0 | 9.0 | NaN | Amsterdam | flexible | 1.0 | 0.62 | 52.37014616115532,4.866281767968382 | Host Has Profile Pic,Host Identity Verified,Is... |
| 4 | 14463171 | https://www.airbnb.com/rooms/14463171 | 20170402075052 | 2017-04-02 | Cosy Studio Apartment Center Amsterdam | Studio Apartment Centre Amsterdam Perfect stud... | NaN | Studio Apartment Centre Amsterdam Perfect stud... | none | NaN | ... | 8.0 | 9.0 | 9.0 | NaN | Amsterdam | moderate | 1.0 | 2.04 | 52.368817829014425,4.871249476925455 | Host Has Profile Pic |
5 rows × 89 columns
Here is the list of columns of the original DataFrame
df.columns.values
array(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name',
'Summary', 'Space', 'Description', 'Experiences Offered',
'Neighborhood Overview', 'Notes', 'Transit', 'Access',
'Interaction', 'House Rules', 'Thumbnail Url', 'Medium Url',
'Picture Url', 'XL Picture Url', 'Host ID', 'Host URL',
'Host Name', 'Host Since', 'Host Location', 'Host About',
'Host Response Time', 'Host Response Rate', 'Host Acceptance Rate',
'Host Thumbnail Url', 'Host Picture Url', 'Host Neighbourhood',
'Host Listings Count', 'Host Total Listings Count',
'Host Verifications', 'Street', 'Neighbourhood',
'Neighbourhood Cleansed', 'Neighbourhood Group Cleansed', 'City',
'State', 'Zipcode', 'Market', 'Smart Location', 'Country Code',
'Country', 'Latitude', 'Longitude', 'Property Type', 'Room Type',
'Accommodates', 'Bathrooms', 'Bedrooms', 'Beds', 'Bed Type',
'Amenities', 'Square Feet', 'Price', 'Weekly Price',
'Monthly Price', 'Security Deposit', 'Cleaning Fee',
'Guests Included', 'Extra People', 'Minimum Nights',
'Maximum Nights', 'Calendar Updated', 'Has Availability',
'Availability 30', 'Availability 60', 'Availability 90',
'Availability 365', 'Calendar last Scraped', 'Number of Reviews',
'First Review', 'Last Review', 'Review Scores Rating',
'Review Scores Accuracy', 'Review Scores Cleanliness',
'Review Scores Checkin', 'Review Scores Communication',
'Review Scores Location', 'Review Scores Value', 'License',
'Jurisdiction Names', 'Cancellation Policy',
'Calculated host listings count', 'Reviews per Month',
'Geolocation', 'Features'], dtype=object)
len(df.axes[1])
89
len(df.axes[0])
494954
We will extract only useful columns for this study
df = df.filter(['Longitude','Latitude','Space','Accommodates','City','Host Response Time','Host Verifications','Street','Property Type','Room Type','Bathrooms','Bedrooms','Beds','Price','Cleaning Fee','Review Scores Rating','Cancellation Policy'])
We have to drop raws with missing data to have a complete dataset
df =df.dropna()
The dataset contents also other cities than Paris, we need to keep only the concerned raws
#ON CHECK LES NOMS DES VILLES
df['City'].value_counts()
Paris 17954
London 15581
Los Angeles 9008
Barcelona 7316
New York 7056
...
Royal Arsenal Riverside 1
Streatham Hill, London, England, GB 1
Palmers Green 1
Leyton 1
Афины 1
Name: City, Length: 3087, dtype: int64
#ON GARDE UNIQUEMENT PARIS
df.drop( df[ df['City'] != "Paris" ].index, inplace=True)
The study concern only appartments so we extract only the raws concerned by filtering the Property Type column
df['Property Type'].value_counts()
Apartment 17249 Loft 271 House 197 Bed & Breakfast 80 Condominium 42 Other 35 Townhouse 25 Guesthouse 22 Boat 21 Boutique hotel 2 Serviced apartment 2 Earth House 2 Treehouse 1 Timeshare 1 Hostel 1 Tipi 1 Camper/RV 1 Dorm 1 Name: Property Type, dtype: int64
df.drop( df[ df['Property Type'] != "Apartment" ].index, inplace=True)
We need to check if there is a coherence in street's name
df['Street'].value_counts()
Montmartre, Paris, Île-de-France 75018, France 956
Saint-Germain-des-Prés - Odéon, Paris, Île-de-France 75006, France 614
Batignolles, Paris, Île-de-France 75017, France 485
Bastille, Paris, Île-de-France 75011, France 460
République, Paris, Île-de-France 75011, France 457
...
Paris, Île de France 75010, France 1
Gare du Nord - Gare de I'Est, Paris, Ile-de-France 75010, France 1
Paris, Ãle-de-France 75019, France 1
Paris, Ile de France 75019, France 1
Champs-Elysées, Paris, Ile-de-France 75008, France 1
Name: Street, Length: 413, dtype: int64
In order to deal with relevent numeric features we extract only ZIP code in the Street column
df['Street']=df['Street'].str.extract('(\d+)')
df =df.dropna()
df['Street'] = df['Street'].astype(int)
df.drop( df[ (df['Street']< 75001) | (df['Street']> 75020) ].index, inplace=True)
df['Street'] = df['Street'] - 75000
Now we have this result in the street column, which is more useful for the study using linear regression
df['Street'].value_counts()
18 1739 11 1588 10 1201 15 1111 3 994 17 977 4 900 6 822 5 797 9 771 2 742 20 711 7 656 19 627 14 567 8 525 1 513 12 511 13 428 16 411 Name: Street, dtype: int64
Host Response Time values are :
df['Host Response Time'].value_counts()
within an hour 8207 within a few hours 4259 within a day 3656 a few days or more 469 Name: Host Response Time, dtype: int64
Room Type values are :
df['Room Type'].value_counts()
Entire home/apt 15406 Private room 1115 Shared room 70 Name: Room Type, dtype: int64
Cancellation Policy values are :
df['Cancellation Policy'].value_counts()
strict 8511 moderate 5586 flexible 2492 super_strict_30 2 Name: Cancellation Policy, dtype: int64
So we have different specific categories for each column:
We have to put them into numeric values, since again, regression doesn't deal with strings.
We create a dictionnary which link each type to an integer for every concerned columns.
response_type = {
'within an hour' : 1,
'within a few hours' : 2,
'within a day' : 3,
'a few days or more' : 4,
}
nw_list_response = []
for i in df['Host Response Time']:
nw_list_response.append(response_type[i])
df['Host Response Time'] = nw_list_response
room_type = {
'Entire home/apt' : 1,
'Private room' : 2,
'Shared room' : 3,
}
nw_list_room = []
for i in df['Room Type']:
nw_list_room.append(room_type[i])
df['Room Type'] = nw_list_room
cancel_policy = {
'strict' : 1,
'flexible' : 2,
'moderate' : 3,
'super_strict_30' : 4,
}
nw_list_cancelpolicy= []
for i in df['Cancellation Policy']:
nw_list_cancelpolicy.append(cancel_policy[i])
df['Cancellation Policy'] = nw_list_cancelpolicy
Here is an exemple of the content of the column 'Space'
df['Space'].iloc[2]
'This charming 25m2 studio is very bright and colourful. With a bohemian feel, has everything for 2 guests to feel comfortable and cozy. On a 1st floor without elevator, on a great part of Paris, easy access to Metro Stations and bus stops at the doorstep. Supermarket and bakery just across the street and fabulous restaurants all around in less than 2 minutes walk.'
Our strategy to treat this kind of string is the following steps:
df=df[df["Space"].str.contains("m2")==True]
df['Surface']=df.Space.str.split("m2", expand=True)[0]
df['Surface']=df['Surface'].str.split().str[-1]
df['Surface']=df['Surface'].str.extract('(\d+)')
So now we have a new column with only the surface in it, we don't need the Space column anymore
df['Host Verifications'].value_counts()
email,phone,reviews,jumio 879
email,phone,reviews 671
email,phone,facebook,reviews,jumio 219
email,phone,reviews,jumio,government_id 178
email,phone,facebook,reviews 141
...
email,phone,facebook,amex,reviews,kba,work_email 1
email,phone,linkedin,reviews,jumio,offline_government_id,government_id,work_email 1
email,phone,facebook,google,reviews,jumio 1
email,phone,google,reviews,jumio,offline_government_id,government_id,work_email 1
email,phone,facebook,linkedin,reviews,jumio,work_email 1
Name: Host Verifications, Length: 80, dtype: int64
The "Host Verifications" is a list of items that the host has verified in his profile as the Id, the adress, the facebook account...
To deal with integers we will replace this column by a the number of verifications done by the host, by counting each verification
df['Host Verifications']= df['Host Verifications'].str.replace(',', ' ')
df['Host Verifications'] = df['Host Verifications'].apply(lambda n: len(n.split()))
Now instead of a string of characters we have a kind of score of verification
df['Host Verifications'].value_counts()
4 1066 3 681 5 590 6 238 7 73 2 28 8 12 9 4 1 3 Name: Host Verifications, dtype: int64
Let's check teh dataframe now
df.head()
| Longitude | Latitude | Space | Accommodates | City | Host Response Time | Host Verifications | Street | Property Type | Room Type | Bathrooms | Bedrooms | Beds | Price | Cleaning Fee | Review Scores Rating | Cancellation Policy | Surface | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24345 | 2.365122 | 48.855027 | This charming 25m2 studio is very bright and c... | 2.0 | Paris | 2 | 4 | 4 | Apartment | 1 | 1.0 | 0.0 | 1.0 | 80.0 | 40.0 | 94.0 | 1 | 25 |
| 24352 | 2.357037 | 48.857787 | 28m2, 2 Separated Mezzanines, one kitchen, one... | 4.0 | Paris | 3 | 4 | 4 | Apartment | 1 | 1.0 | 2.0 | 2.0 | 94.0 | 26.0 | 80.0 | 1 | 28 |
| 24387 | 2.353065 | 48.858664 | A 430 sq ft / 40 m2, one bedroom apartment is ... | 4.0 | Paris | 3 | 4 | 4 | Apartment | 1 | 1.0 | 1.0 | 1.0 | 134.0 | 50.0 | 98.0 | 1 | 40 |
| 24397 | 2.355115 | 48.852292 | Hello future friends, I would like to describe... | 2.0 | Paris | 1 | 3 | 4 | Apartment | 1 | 1.0 | 1.0 | 1.0 | 135.0 | 45.0 | 96.0 | 1 | 32 |
| 24563 | 2.353662 | 48.890418 | Duplex full of light in the middle of the Gout... | 2.0 | Paris | 4 | 5 | 18 | Apartment | 1 | 1.0 | 1.0 | 1.0 | 60.0 | 20.0 | 91.0 | 3 | 45 |
len(df)
2695
df.dtypes
Longitude float64 Latitude float64 Space object Accommodates float64 City object Host Response Time int64 Host Verifications int64 Street int32 Property Type object Room Type int64 Bathrooms float64 Bedrooms float64 Beds float64 Price float64 Cleaning Fee float64 Review Scores Rating float64 Cancellation Policy int64 Surface object dtype: object
We drop columns that are not relevent
df = df.drop(columns=['City', 'Property Type','Space'], axis=1)
We rename columns that we transformed with simple and more accurate names.
df = df.rename({'Host Verifications': 'Host Credibility Degree', 'Street': 'District','Review Scores Rating':'Review'}, axis=1)
We to reset the index to have now a cleaned and complete dataset ready to be used
df =df.dropna()
df=df.reset_index()
df = df.drop(columns=['index'], axis=1)
The dataset at the end of the cleaning step :
df
| Longitude | Latitude | Accommodates | Host Response Time | Host Credibility Degree | District | Room Type | Bathrooms | Bedrooms | Beds | Price | Cleaning Fee | Review | Cancellation Policy | Surface | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.365122 | 48.855027 | 2.0 | 2 | 4 | 4 | 1 | 1.0 | 0.0 | 1.0 | 80.0 | 40.0 | 94.0 | 1 | 25 |
| 1 | 2.357037 | 48.857787 | 4.0 | 3 | 4 | 4 | 1 | 1.0 | 2.0 | 2.0 | 94.0 | 26.0 | 80.0 | 1 | 28 |
| 2 | 2.353065 | 48.858664 | 4.0 | 3 | 4 | 4 | 1 | 1.0 | 1.0 | 1.0 | 134.0 | 50.0 | 98.0 | 1 | 40 |
| 3 | 2.355115 | 48.852292 | 2.0 | 1 | 3 | 4 | 1 | 1.0 | 1.0 | 1.0 | 135.0 | 45.0 | 96.0 | 1 | 32 |
| 4 | 2.353662 | 48.890418 | 2.0 | 4 | 5 | 18 | 1 | 1.0 | 1.0 | 1.0 | 60.0 | 20.0 | 91.0 | 3 | 45 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2676 | 2.387538 | 48.848758 | 8.0 | 1 | 4 | 12 | 1 | 2.0 | 4.0 | 6.0 | 229.0 | 70.0 | 97.0 | 1 | 150 |
| 2677 | 2.389579 | 48.840585 | 2.0 | 3 | 5 | 12 | 1 | 1.0 | 1.0 | 1.0 | 45.0 | 30.0 | 96.0 | 3 | 27 |
| 2678 | 2.302691 | 48.880195 | 8.0 | 1 | 4 | 17 | 1 | 2.0 | 3.0 | 4.0 | 160.0 | 90.0 | 78.0 | 1 | 85 |
| 2679 | 2.306053 | 48.857639 | 4.0 | 1 | 6 | 7 | 1 | 1.0 | 1.0 | 2.0 | 155.0 | 35.0 | 97.0 | 1 | 40 |
| 2680 | 2.326907 | 48.851534 | 4.0 | 3 | 4 | 6 | 1 | 1.0 | 1.0 | 3.0 | 136.0 | 117.0 | 100.0 | 1 | 45 |
2681 rows × 15 columns
We import the useful libraries for data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
An interesting point : we can recognize the shape of Paris
df.plot(kind="scatter",
x="Longitude", y="Latitude",
alpha=0.4, figsize=(14,7),
c="Price", cmap="gist_heat_r",
colorbar=True, sharex=False);
import plotly.express as px
import pandas as pd
#df = pd.read_csv("location_coordinate.csv")
fig = px.scatter_geo(df,lat='Latitude',lon='Longitude', hover_name="Price",color='Price',scope='europe',height=800, width=1000,center=dict(lat=48.855, lon=2.325))
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()
df.boxplot(column =['Price'], grid = False)
df['Price'].describe()
count 2681.000000 mean 100.458038 std 66.847296 min 20.000000 25% 60.000000 50% 80.000000 75% 120.000000 max 800.000000 Name: Price, dtype: float64
We delete outliers to have a homogeneous data
# Calculate the IQR
q1 = df['Price'].quantile(0.25)
q3 = df['Price'].quantile(0.75)
iqr = q3 - q1
# Calculate the lower and upper boundaries
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Filter the DataFrame to only include rows where 'col' is within the boundaries
df = df[(df['Price'] > lower_bound) & (df['Price'] < upper_bound)]
sns.barplot(x="Room Type", y="Price", data=df)
plt.title('Price according to the type of room')
plt.show()
sns.barplot(x="District", y="Price", data=df)
plt.title('Price according to the district')
plt.show()
plt.scatter(x="Review", y="Price", data=df)
plt.title('Price according to the rating')
plt.show()
sns.barplot(x="Host Credibility Degree", y="Price", data=df)
plt.title('Price according to the degree of host verirification')
plt.show()
sns.barplot(x="Host Response Time", y="Review", data=df)
plt.title('Price according to the Host Response Time')
plt.show()
This part will we dedicated to predictions with machine learning models. We will exclusively use Regression model to try and predict the price of a given appartment in Paris.
First of all, let's get a copy of the original dataset.
df_ml = df.copy()
We know in advance that we want to predict the 'Price' feature, which is a float. Thus, the regression is the best way to achieve such a prediction. We have chosen to make a list of different regression models that we found on the official scikit-learn website documentation.
We start by importing these models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ARDRegression
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.linear_model import TheilSenRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
here is the list of all the models liste above, properly set up.
models = [LinearRegression(), Lasso(alpha=0.1), Ridge(alpha=1.2), ElasticNet(alpha=0.01), BayesianRidge(), ARDRegression(), OrthogonalMatchingPursuit(), TweedieRegressor(max_iter=300), TheilSenRegressor(), HuberRegressor(max_iter=3000), RandomForestRegressor(max_depth=8), linear_model.Lars(n_nonzero_coefs=7, normalize=False), GradientBoostingRegressor(random_state=3), AdaBoostRegressor(random_state=0, n_estimators=100), BaggingRegressor()]
models_name = ['LinearRegression', 'Lasso', 'Ridge', 'ElasticNet', 'BayesianRidge', 'ARDRegression', 'OrthogonalMatchingPursuit', 'TweedieRegressor', 'TheilSenRegressor', 'HuberRegressor', 'RandomForestRegressor', 'LARS', 'GradientBoostingRegressor', 'AdaBoostRegressor', 'BaggingRegressor']
What we want to do now is the try all the models and find which one is the most performant in predicting the value of the 'Price' features. For that, we have 2 indicators : the r-squared and the rmse (Root-mean-squared error).
Let's calculate these 2 scores for all the models and put them in dataframes.
values_r2 = [] #the list that will store the r-squared values
values_root = [] #the list that will store the rmse values
for model in models: #we iterate for each models in the list
y = df_ml['Price'].values #y is the feature we want to predict
x = df_ml.drop('Price', 1).values #x is the dataset that will predict the y
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=42) #we split the dataset
reg = model
reg.fit(X_train, y_train) #we fit the dataset
y_pred = reg.predict(X_test)
values_r2.append(reg.score(X_test, y_test)) #we put the r-squared into the r2 list
rmse = np.sqrt(mean_squared_error(y_test, y_pred)) #we calculate the rmse
values_root.append(rmse) #and then put it into the rmse list
The aim by creating such lists is to plot the results, so we can have a visual representation of the best models. In order to make some display, we have to deal with dataframes.
#we create dataframes with the lists
r2 = pd.DataFrame(values_r2, columns=['r2'])
r2['model'] = models_name
rt = pd.DataFrame(values_root, columns=['RootMeanSquaredError'])
rt['model'] = models_name
Here are somes proporties about the r-squared and rmse indicators :
sns.barplot(x="r2", y="model",data=r2, order=r2.sort_values(['r2']).model)
plt.xlim(0,0.95)
plt.title('r-squared value according to the regression model used')
plt.show()
sns.barplot(x="RootMeanSquaredError", y="model",data=rt, order=rt.sort_values(['RootMeanSquaredError']).model.iloc[::-1])
plt.xlim(0.1,100)
plt.title('rmse value according to the regression model used')
plt.show()
We see here that some models are very efficient, while others have bad indicators values. Among the best models (which the higher r2 and the lower rmse), there is GradientRegressor
Let's use the most efficient model to check if the prediction is really good. To do that, we are going to calculate the % of error of the prediction (the difference between the predicted value and the actual value).
y = df_ml['Price'].values
x = df_ml.drop('Price', 1).values
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=42)
reg = GradientBoostingRegressor()
reg.fit(X_train, y_train)
pred = reg.predict(x)
#the error is the difference between the value calculated by the prediction and the acutal value
error = (abs(df_ml['Price'] - pred))/df_ml['Price']*100
The 'error' list contains the percentage of error for each appartment. Let's calculate the mean of it.
error_mean = sum(error)/len(error)
error_mean
6.671104921196848
The predictions seems quite good, according to the data we had above.
Here is the end of our study on a machine learning case involving airbnb apartments. This study not only allowed us to improve our machine learning skills but also to understand how machine learning actually works, and how features are taken into account by the different models.
Of course our study isn't perfect, and some points still needed to be improved. Here's a list of some changes we should think of.